Intro

This report summarises the R-code used to derive the socio-demographic index (SDI) covariate for the GPW13 baseline and projection stream of work. It also provides the input data and generated estimates of SDI together with plots showing the variation of the estimates by country, region and over time. The SDI composite variable generated here is a version of the Socio-demographic Index (SDI) that was originally developed for the 2015 Global Burden of Disease study. It is a development status indicator comprising of components that are strongly correlated with health outcomes and is calculated as the geometric mean of normalised 0 to 1 indices of total fertility rate, mean years of schooling for those aged 15 and older, and lagged income per capita. Although there are SDI versions generated for the GBD2019 study, the input data used there are not publically available e.g. fertility. Thus, we construct the metric using publicly available fertility data from UNDESA World Population Prospects 2019 (WPP2019), as well as GDP per capita and education data from the UNDP Human Development Reports

Input data and SDI calculation

The code below reads in the income, education and fertility data mentioned above.

Next, the raw data are smoothed to reduce random errors and the results are used to generate the SDI variable:

#################################################################################################################
# Function to smooth the historical data for random fluctuations
smoother <- function(df, var, lagy){
  setnames(df, var, "y")
  dfx    <- df %>% mutate(year = year + lagy) %>% 
    select(iso3,year,y) %>% filter(iso3 %in% isos & year>1989) 
  dfy    <- data.table(iso3=character(), year=integer(), y=integer())
  
  for (is in unique(dfx$iso3)){
  dfz    <- dfx %>% filter(iso3 == is) %>% arrange(year)
    if (nrow(dfz) < 2){
        dfy   <- rbind(dfy, dfz)
    } else {
        xi = dfz$year; yi = dfz$y; xp = min(xi):max(xi)
       yp     <- predict(smooth.Pspline(xi, yi,  spar = 1), xp)[,1]
       dfy    <- rbind(dfy, data.table(iso3 = is, year = xp, y = yp))
    }
  }
  dfy %>% setnames(., "y", var) %>% arrange(iso3, year) 
}

#################################################################################################################
#Smoothed data
educ     <- educ.r %>% smoother(.,"educ", 0)
tfr      <- tfr.r  %>% smoother(.,"tfr", 0)
gdp      <- gdp.r  %>% smoother(.,"gdp", 5) %>% mutate(ldi=log(gdp), gdp=NULL) 

#  Issue with Liberia, smooth with lowess
lbr.ldi  <- (gdp %>% filter(iso3 == "LBR") %>% pull(ldi) %>% lowess())$y
gdp      <- gdp %>% mutate(ldi = ifelse(iso3 == "LBR", lbr.ldi, ldi))

#################################################################################################################
# Bring it all together i.e. transform variables to 0-1 scale, calculate SDI
sdi.vars <- educ %>%
  left_join(tfr, by = c("iso3", "year")) %>%
  left_join(gdp, by = c("iso3", "year")) %>%
  left_join(loc_label, by = "iso3") %>%
  select(iso3, year, ldi, tfr, educ, region) %>%
  filter(!(is.na(ldi)|is.na(tfr)|is.na(educ))) %>% 
  mutate(educ.t = (educ - min(educ))/(max(educ) - min(educ)),
         ldi.t  = (ldi  - min(ldi))/(max(ldi) - min(ldi)),
         tfr.t  = 1 - (tfr - min(tfr))/(max(tfr) - min(tfr))) %>% 
  rowwise() %>%
  mutate(sdi.pub = round(prod(educ.t, ldi.t, tfr.t)^(1/3),3)) %>% ungroup() %>%
  filter(year > 1999)

sdi.data <- list(loc_label=loc_label, educ.r=educ.r, tfr.r=tfr.r, gdp.r=gdp.r,
                 educ=educ, tfr=tfr, gdp=gdp, sdi.vars=sdi.vars)

The raw SDI input data, smoothed intermediate files and calculated SDI can be downloaded as an R-database using the sdi.data.RDa link below:

A number of countries are missing at least one input and thus an SDI cannot be estimated for those countries directly. These countries are listed below:

##  [1] "AND" "ATG" "BHS" "BLZ" "BRN" "BTN" "COK" "DMA" "ERI" "FJI" "FSM" "GRD"
## [13] "GUY" "KIR" "KNA" "MCO" "MDV" "MHL" "NIU" "NRU" "PLW" "PNG" "PRK" "SLB"
## [25] "SMR" "SOM" "SSD" "SUR" "SVK" "TLS" "TON" "TUV" "VCT" "VUT" "WSM"

For these missing countries we can impute the SDI according to a regression model. As mentioned previously, a version of the SDI has been generated for the GBD2019 using other data sources that are not publically available. The GBD2019 SDI (sdi.gbd) and the SDI from the public data (sdi.pub) can be compared:

To impute the SDI in the 35 locations listed above, a simple linear regression model is fit to sdi.pub using the sdi.gbd as well as region.

The model estimates are shown in the Table below:

Characteristic Beta 95% CI1 p-value
sdi.gbd 1.1 1.0, 1.1 <0.001
region
Caribbean
Caribbean -0.02 -0.02, -0.01 <0.001
Central America -0.01 -0.02, 0.00 0.102
Central Asia 0.00 -0.01, 0.01 0.464
Eastern Africa -0.05 -0.06, -0.05 <0.001
Eastern Asia -0.04 -0.05, -0.03 <0.001
Eastern Europe -0.02 -0.03, -0.01 <0.001
Middle Africa -0.06 -0.07, -0.06 <0.001
Northern Africa -0.06 -0.07, -0.05 <0.001
Northern America -0.02 -0.03, -0.01 0.004
Northern Europe -0.05 -0.06, -0.04 <0.001
Oceania -0.03 -0.04, -0.01 <0.001
South-Eastern Asia -0.04 -0.05, -0.03 <0.001
South America -0.02 -0.02, -0.01 <0.001
Southern Africa -0.03 -0.04, -0.02 <0.001
Southern Asia -0.01 -0.02, 0.00 0.017
Southern Europe -0.04 -0.05, -0.03 <0.001
Western Africa -0.06 -0.06, -0.05 <0.001
Western Asia -0.06 -0.07, -0.05 <0.001
Western Europe -0.06 -0.07, -0.04 <0.001
R2 = 0.9974; R2 adjusted = 0.9974

1 CI = Confidence Interval

Based on the model diagnostics and assessing the model fit by comparing the fitted to the observed values, we see that in general, the model fits the data well.

The model is used to impute SDI where missing:

Forecasting the SDI covariate

The goal being to potentially forecast indicator values for the GPW13 and where missing, to impute them using the SDI as a covariate, an initial step is to complete the SDI time-series to cover the entire range relevant to the GPW13. This is accomplished by forecasting the SDI for the years applicable to the GPW13 (also back-casting where historical data are non-existent). In this case we forecast the SDI time-series using a random walk (RW2) model which has the density \[\pi(y) \propto exp\left(-\frac{1}{2}\sum_{i=3}^n (y_i - 2y_{i-1} + y_{i-2})^2\right)\] where the term \(y_{i+1} −2y_i +y_{i−1}\) can be interpreted as an estimate of the second order derivative of a continuous time function \(y(t)\) at \(t = i\) using values of \(y(t)\) at \(t = i − 1\), \(i\), and \(i+1\).

In addition to SDI, we also take life-expectancy as a potential covariate. From the UNPOP division WPP2019, life-expectancy at birth has aleady been forecasted for 183 of the 194 member states. We read in the data:

And also provide a list the countries missing \(e_0\):

##  [1] "AND" "COK" "DMA" "KNA" "MCO" "MHL" "NIU" "NRU" "PLW" "SMR" "TUV"

We can impute life-expectancy using the SDI, again utilizing a basic linear model (also noting that we are effectively imposing a collinearity for the countries missing both SDI and \(e_0\)):

The model estimates are shown in the Table below:

Characteristic Beta 95% CI1 p-value
sdi 33 32, 33 <0.001
region
Caribbean
Caribbean 52 51, 52 <0.001
Central America 55 55, 56 <0.001
Central Asia 48 47, 49 <0.001
Eastern Africa 50 49, 50 <0.001
Eastern Asia 53 52, 54 <0.001
Eastern Europe 49 48, 49 <0.001
Middle Africa 45 45, 46 <0.001
Northern Africa 53 52, 54 <0.001
Northern America 52 50, 53 <0.001
Northern Europe 52 51, 53 <0.001
Oceania 52 51, 53 <0.001
South-Eastern Asia 52 51, 53 <0.001
South America 53 52, 54 <0.001
Southern Africa 38 37, 39 <0.001
Southern Asia 54 53, 55 <0.001
Southern Europe 54 53, 55 <0.001
Western Africa 49 49, 49 <0.001
Western Asia 53 52, 54 <0.001
Western Europe 53 52, 54 <0.001
R2 = 0.9975; R2 adjusted = 0.9975

1 CI = Confidence Interval

Finally, we bring all the files together to create the covariate database complete for all countries and years 2000 to 2025:

All of the covariate raw and intermediate files and the final covariates dataframe can be downloaded as an R-database using the covariate.data.RDa link below:

The plot below illustrates the estimated and forecasted SDI by country. Observed values till the year 2017 are represented by the golden circles. The purple lines show the forecast described above with the uncertainty range depicted using the same colour. The plots also show the time-series of life-expectancy at birth \(e_0\) with a dashed series in dark green and the relevant axis to the right:

Maps for 2023

We can also inspect the geographic distribution of the values of these covariates that is forecasted for the 2023. Beginning with SDI:

And also showing life-expectancy at birth: